Introduction

For this particular project, we are looking at the stock prices of four major tech companies: Apple, Facebook, Amazon and Google, from the years of, correspondingly, 1980, 2012, 1997 and 2004 to the year of 2018. With the data downloaded from Kaggle (*https://www.kaggle.com/stexo92/gafa-stock-prices*), we got access to the dates, opening prices, closing prices, stock volume, highest/lowest prices and adjusted close prices of these companies.

Based on the nature of stock market, there can potentially be temporal structures when analyzing and predicting stock prices. We take the closing prices of the stocks as the response variable and try to fit appropriate models to help determine what affects the close stock prices for each company, as well as understanding the temporal structures within.

We will first look at some Explanatory Data Analyses for all four companies, then try to fit simpler models with no temporal structures, and finally fit and evaluate temporal models for each company. For the temporal model fitting part specifically, we will try two different methods: both auto-fitting ARIMA models, as well as models with Gaussian Process. We would then compare the performances of all three types models fit by both methods.

Lastly, we wish to come to a conclusion for our questions of interest: what are the factors that can potentially affect the closing prices of stocks? Is there any temporal dependency in the closing prices? Are there differences among different companies or they share similar trends and structures in their stocks? We would also have a discussion on the adequacy, potential problems with the models and provide suggestions for developing this project.

EDA

summary(df)
##       Stock           Date                 Open              High        
##  Amazon  :1497   Min.   :2012-05-09   Min.   :  18.08   Min.   :  18.27  
##  Apple   :1497   1st Qu.:2013-11-05   1st Qu.:  97.60   1st Qu.:  98.60  
##  Facebook:1490   Median :2015-05-04   Median : 212.14   Median : 215.35  
##  Google  :1497   Mean   :2015-05-02   Mean   : 351.09   Mean   : 354.08  
##                  3rd Qu.:2016-10-25   3rd Qu.: 552.10   3rd Qu.: 555.26  
##                  Max.   :2018-04-20   Max.   :1615.96   Max.   :1617.54  
##       Low              Close           Adj.Close           Volume         
##  Min.   :  17.55   Min.   :  17.73   Min.   :  17.73   Min.   :     7900  
##  1st Qu.:  96.58   1st Qu.:  97.50   1st Qu.:  94.30   1st Qu.:  2745600  
##  Median : 207.75   Median : 212.91   Median : 212.91   Median : 10815000  
##  Mean   : 347.75   Mean   : 351.06   Mean   : 349.02   Mean   : 26572272  
##  3rd Qu.: 545.33   3rd Qu.: 551.97   3rd Qu.: 551.97   3rd Qu.: 37140600  
##  Max.   :1590.89   Max.   :1598.39   Max.   :1598.39   Max.   :573576400  
##       diff          
##  Min.   :-79.18005  
##  1st Qu.: -1.34998  
##  Median :  0.01001  
##  Mean   : -0.02581  
##  3rd Qu.:  1.46000  
##  Max.   : 81.38000
ggplot(data = df, aes(x = Stock, y = Close)) +
  geom_boxplot() +
  ggtitle("Different Close Price Distribution of Different Stocks")

p1 = ggplot(data = df %>% filter(Stock == "Amazon") %>% arrange(Date), aes(x = Date, y = Close)) +
  geom_line() +
  ggtitle("Stock Close Price of Amazon")
p2 = ggplot(data = df %>% filter(Stock == "Apple") %>% arrange(Date), aes(x = Date, y = Close)) +
  geom_line() +
  ggtitle("Stock Close Price of Apple")
p3 = ggplot(data = df %>% filter(Stock == "Google") %>% arrange(Date), aes(x = Date, y = Close)) +
  geom_line() +
  ggtitle("Stock Close Price of Google")
p4 = ggplot(data = df %>% filter(Stock == "Facebook") %>% arrange(Date), aes(x = Date, y = Close)) +
  geom_line() +
  ggtitle("Stock Close Price of Facebook")
grid.arrange(p1, p2, p3, p4, nrow = 2, ncol = 2)

p1 = ggplot(data = df %>% filter(Stock == "Amazon") %>% arrange(Date), aes(x = Date, y = Open)) +
  geom_line() +
  ggtitle("Stock Open Price of Amazon")
p2 = ggplot(data = df %>% filter(Stock == "Apple") %>% arrange(Date), aes(x = Date, y = Open)) +
  geom_line() +
  ggtitle("Stock Open Price of Apple")
p3 = ggplot(data = df %>% filter(Stock == "Google") %>% arrange(Date), aes(x = Date, y = Open)) +
  geom_line() +
  ggtitle("Stock Open Price of Google")
p4 = ggplot(data = df %>% filter(Stock == "Facebook") %>% arrange(Date), aes(x = Date, y = Open)) +
  geom_line() +
  ggtitle("Stock Open Price of Facebook")
grid.arrange(p1, p2, p3, p4, nrow = 2, ncol = 2)

p1 = ggplot(data = df %>% filter(Stock == "Amazon") %>% arrange(Date), aes(x = Date, y = diff)) +
  geom_line() +
  ggtitle("Stock Price Daily Change of Amazon")
p2 = ggplot(data = df %>% filter(Stock == "Apple") %>% arrange(Date), aes(x = Date, y = diff)) +
  geom_line() +
  ggtitle("Stock Price Daily Change of Apple")
p3 = ggplot(data = df %>% filter(Stock == "Google") %>% arrange(Date), aes(x = Date, y = diff)) +
  geom_line() +
  ggtitle("Stock Price Daily Change of Google")
p4 = ggplot(data = df %>% filter(Stock == "Facebook") %>% arrange(Date), aes(x = Date, y = diff)) +
  geom_line() +
  ggtitle("Stock Price Daily Change of Facebook")
grid.arrange(p1, p2, p3, p4, nrow = 2, ncol = 2)

ggplot(data = cor(df %>% select(Close, Open, Volume, diff, High, Low)) %>% reshape2::melt(), aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile() +
  ggtitle("Correltion matrix")

naive.lm = lm(data = df, formula = Close ~ as.factor(Stock) + Date)
naive.lm %>% summary()
## 
## Call:
## lm(formula = Close ~ as.factor(Stock) + Date, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -253.24 -105.74  -17.71   78.96  787.50 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -3.325e+03  4.643e+01  -71.62   <2e-16 ***
## as.factor(Stock)Apple    -4.571e+02  4.951e+00  -92.33   <2e-16 ***
## as.factor(Stock)Facebook -4.759e+02  4.956e+00  -96.02   <2e-16 ***
## as.factor(Stock)Google    7.450e+01  4.951e+00   15.05   <2e-16 ***
## Date                      2.350e-01  2.796e-03   84.03   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 135.4 on 5976 degrees of freedom
## Multiple R-squared:  0.8238, Adjusted R-squared:  0.8237 
## F-statistic:  6984 on 4 and 5976 DF,  p-value: < 2.2e-16
plot(shuffle(naive.lm$residuals), main = "Naive Residual Plot", ylab = "Residual")

df = df %>% mutate(naive.pred = naive.lm$fitted.values)
rmse(df$Close, df$naive.pred)
## [1] 135.3863

Time Series Analysis and Models

\[ \begin{aligned} Close_t &= ARIMA_{(p, q, d) \times (P, Q, D)_s}(Close_{t-1, \cdots}) + \beta_s \end{aligned} \]

amazon.stock = df %>% filter(Stock == "Amazon") %>% arrange(Date)
ggtsdisplay(amazon.stock %>% select(Close), main = "Difference of Price")

ts.model = auto.arima(amazon.stock %>% select(Close), seasonal = TRUE)
ggtsdisplay(ts.model$residuals, main = "Arima(2,2,0)")

ts.model %>% summary()
## Series: amazon.stock %>% select(Close) 
## ARIMA(2,2,0) 
## 
## Coefficients:
##           ar1      ar2
##       -0.6589  -0.3277
## s.e.   0.0246   0.0246
## 
## sigma^2 estimated as 172.1:  log likelihood=-5968.77
## AIC=11943.54   AICc=11943.55   BIC=11959.47
## 
## Training set error measures:
##                        ME     RMSE      MAE          MPE     MAPE     MASE
## Training set -0.005769807 13.10128 8.084945 -0.002588783 1.493505 1.193601
##                     ACF1
## Training set -0.08137831
amazon.stock = amazon.stock %>% mutate(ts.residuals = ts.model$residuals)

res.lm.model = lm(data = amazon.stock, formula = ts.residuals ~ 1)
res.lm.model %>% summary()
## 
## Call:
## lm(formula = ts.residuals ~ 1, data = amazon.stock)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -77.134  -5.187  -0.392   5.086 126.534 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.00577    0.33873  -0.017    0.986
## 
## Residual standard error: 13.11 on 1496 degrees of freedom
amazon.stock = amazon.stock %>% 
  mutate(ts.pred = ts.model$fitted + res.lm.model$fitted.values)
amazon.stock$residuals = c(amazon.stock$Close - amazon.stock$ts.pred)

ggplot(data = amazon.stock, aes(x = Date)) +
  geom_line(aes(y = Close, color = "red")) +
  geom_line(aes(y = ts.pred, color = "blue")) +
  xlab("time") +
  ylab("Close Price of Amazon Stock") +
  ggtitle("ARIMA(2,2,0)")

rmse(amazon.stock$Close, amazon.stock$naive.pred)
## [1] 203.9826
rmse(amazon.stock$Close, amazon.stock$ts.pred)
## [1] 13.10128
forecast(ts.model, 30) %>% plot(xlim=c(1250, 1520))

plot(amazon.stock$residuals, ylab = "Residuals", main = "Residual Plot of ARIMA")

apple.stock = df %>% filter(Stock == "Apple") %>% arrange(Date)
ggtsdisplay(apple.stock %>% select(Close), main = "Difference Price")

ts.model = auto.arima(apple.stock %>% select(Close), seasonal = TRUE)
ggtsdisplay(ts.model$residuals, main = "Arima(2,1,2)")

ts.model %>% summary()
## Series: apple.stock %>% select(Close) 
## ARIMA(2,1,2) 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2
##       -0.6752  -0.9108  0.6999  0.8846
## s.e.   0.0493   0.0537  0.0567  0.0575
## 
## sigma^2 estimated as 2.656:  log likelihood=-2851.39
## AIC=5712.79   AICc=5712.83   BIC=5739.34
## 
## Training set error measures:
##                      ME     RMSE      MAE       MPE     MAPE     MASE
## Training set 0.05674586 1.626949 1.149444 0.0358611 1.099604 1.000475
##                    ACF1
## Training set 0.01167058
apple.stock = apple.stock %>% mutate(ts.residuals = ts.model$residuals)

res.lm.model = lm(data = apple.stock, formula = ts.residuals ~ 1)
res.lm.model %>% summary()
## 
## Call:
## lm(formula = ts.residuals ~ 1, data = apple.stock)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7589 -0.7590  0.0060  0.8808  7.5666 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.05675    0.04204    1.35    0.177
## 
## Residual standard error: 1.627 on 1496 degrees of freedom
apple.stock = apple.stock %>% 
  mutate(ts.pred = ts.model$fitted + res.lm.model$fitted.values)
apple.stock$residuals = c(apple.stock$Close - apple.stock$ts.pred)

ggplot(data = apple.stock, aes(x = Date)) +
  geom_line(aes(y = Close, color = "red")) +
  geom_line(aes(y = ts.pred, color = "blue")) +
  xlab("time") +
  ylab("Close Price of Apple Stock") +
  ggtitle("ARIMA(2,1,2)")

rmse(apple.stock$Close, apple.stock$naive.pred)
## [1] 121.063
rmse(apple.stock$Close, apple.stock$ts.pred)
## [1] 1.625959
forecast(ts.model, 30) %>% plot(xlim=c(1250, 1520))

plot(apple.stock$residuals, ylab = "Residuals", main = "Residual Plot of ARIMA")

google.stock = df %>% filter(Stock == "Google") %>% arrange(Date)
ggtsdisplay(google.stock %>% select(Close), main = "Difference of Price")

ts.model = auto.arima(google.stock %>% select(Close), seasonal = TRUE)
ggtsdisplay(ts.model$residuals, main = "Arima(1,1,1)")

ts.model %>% summary()
## Series: google.stock %>% select(Close) 
## ARIMA(1,1,1) with drift 
## 
## Coefficients:
##           ar1     ma1   drift
##       -0.6977  0.7459  0.5135
## s.e.   0.1781  0.1660  0.2491
## 
## sigma^2 estimated as 87.94:  log likelihood=-5469.76
## AIC=10947.52   AICc=10947.55   BIC=10968.76
## 
## Training set error measures:
##                        ME     RMSE     MAE         MPE      MAPE    MASE
## Training set 0.0009613606 9.365017 6.13327 -0.01538857 0.9656908 1.00196
##                      ACF1
## Training set -0.004049327
google.stock = google.stock %>% mutate(ts.residuals = ts.model$residuals)

res.lm.model = lm(data = google.stock, formula = ts.residuals ~ 1)
res.lm.model %>% summary()
## 
## Call:
## lm(formula = ts.residuals ~ 1, data = google.stock)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.552  -3.926  -0.184   4.216  91.539 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0009614  0.2421267   0.004    0.997
## 
## Residual standard error: 9.368 on 1496 degrees of freedom
google.stock = google.stock %>% 
  mutate(ts.pred = ts.model$fitted + res.lm.model$fitted.values)
google.stock$residuals = c(google.stock$Close - google.stock$ts.pred)

ggplot(data = google.stock, aes(x = Date)) +
  geom_line(aes(y = Close, color = "red")) +
  geom_line(aes(y = ts.pred, color = "blue")) +
  xlab("time") +
  ylab("Close Price of Google Stock") +
  ggtitle("ARIMA(1,1,1)")

rmse(google.stock$Close, google.stock$naive.pred)
## [1] 84.95307
rmse(google.stock$Close, google.stock$ts.pred)
## [1] 9.365017
forecast(ts.model, 30) %>% plot(xlim=c(1250, 1520))

plot(google.stock$residuals, ylab = "Residuals", main = "Residual Plot of ARIMA")

facebook.stock = df %>% filter(Stock == "Facebook") %>% arrange(Date)
ggtsdisplay(facebook.stock %>% select(Close), main = "Difference of Price")

ts.model = auto.arima(facebook.stock %>% select(Close), seasonal = TRUE)
ggtsdisplay(ts.model$residuals, main = "Arima(2,1,2)")

ts.model %>% summary()
## Series: facebook.stock %>% select(Close) 
## ARIMA(2,1,2) with drift 
## 
## Coefficients:
##           ar1     ar2     ma1      ma2   drift
##       -0.0468  0.8310  0.0467  -0.8966  0.0882
## s.e.   0.0658  0.0661  0.0545   0.0550  0.0304
## 
## sigma^2 estimated as 2.819:  log likelihood=-2882.04
## AIC=5776.08   AICc=5776.14   BIC=5807.92
## 
## Training set error measures:
##                       ME     RMSE     MAE       MPE     MAPE     MASE
## Training set -0.00257673 1.675672 1.12037 -0.106676 1.516613 0.996834
##                    ACF1
## Training set 0.00453638
facebook.stock = facebook.stock %>% mutate(ts.residuals = ts.model$residuals)

res.lm.model = lm(data = facebook.stock, formula = ts.residuals ~ 1)
res.lm.model %>% summary()
## 
## Call:
## lm(formula = ts.residuals ~ 1, data = facebook.stock)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.4620  -0.7034  -0.0093   0.8094  14.3003 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.002577   0.043425  -0.059    0.953
## 
## Residual standard error: 1.676 on 1489 degrees of freedom
facebook.stock = facebook.stock %>% 
  mutate(ts.pred = ts.model$fitted + res.lm.model$fitted.values)
facebook.stock$residuals = c(facebook.stock$Close - facebook.stock$ts.pred)

ggplot(data = facebook.stock, aes(x = Date)) +
  geom_line(aes(y = Close, color = "red")) +
  geom_line(aes(y = ts.pred, color = "blue")) +
  xlab("time") +
  ylab("Close Price of Apple Stock") +
  ggtitle("ARIMA(2,1,2)")

rmse(facebook.stock$Close, facebook.stock$naive.pred)
## [1] 98.9735
rmse(facebook.stock$Close, facebook.stock$ts.pred)
## [1] 1.67567
forecast(ts.model, 30) %>% plot(xlim=c(1250, 1520))

plot(facebook.stock$residuals, ylab = "Residuals", main = "Residual Plot of ARIMA")

Conclusion

Discussion